Skip to content

[WIP] testing & testability improvements#9866

Open
ReubenBond wants to merge 6 commits intodotnet:mainfrom
ReubenBond:fix/test-flakiness/1
Open

[WIP] testing & testability improvements#9866
ReubenBond wants to merge 6 commits intodotnet:mainfrom
ReubenBond:fix/test-flakiness/1

Conversation

@ReubenBond
Copy link
Copy Markdown
Member

@ReubenBond ReubenBond commented Jan 3, 2026

(this is just a wip/experiment at this stage. Opening a PR just to run CI)

This pull request introduces a new set of diagnostic event definitions for Orleans, covering grain lifecycle, membership, placement, and silo/client lifecycle events. These changes provide a structured way for advanced users to observe and react to important internal events in Orleans clusters, primarily for diagnostics, monitoring, and simulation testing scenarios. The additions include public static classes for event names, listener names, and strongly-typed event payload records for each diagnostic area.

The most important changes are:

Grain Diagnostics

  • Added OrleansGrainDiagnostics class with listener and event names for grain activation lifecycle events, along with corresponding payload records (GrainCreatedEvent, GrainActivatedEvent, GrainDeactivatingEvent, GrainDeactivatedEvent).

Lifecycle Diagnostics

  • Added OrleansLifecycleDiagnostics class for silo and client lifecycle events, including event names for stage and observer transitions, and detailed event payload records (such as LifecycleStageStartingEvent, LifecycleObserverFailedEvent, etc.).

Membership Diagnostics

  • Added OrleansMembershipDiagnostics class for cluster membership events, providing event names and payload records for silo status changes, membership view changes, suspicions, and cluster join/leave events.

Placement and Load Statistics Diagnostics

  • Added OrleansPlacementDiagnostics class for placement and silo load statistics events, with event names and payload records for statistics publication, reception, cluster-wide refresh, and removal.
Microsoft Reviewers: Open in CodeFlow

Copilot AI review requested due to automatic review settings January 3, 2026 20:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces comprehensive testing infrastructure improvements for Orleans by adding diagnostic event collection, FakeTimeProvider integration for deterministic time control, and event-driven waiting patterns. The changes enable faster, more reliable tests by replacing polling/sleep-based waiting with event-driven approaches and virtual time control.

Key Changes:

  • Added diagnostic observer infrastructure (GrainDiagnosticObserver, TimerDiagnosticObserver, ReminderDiagnosticObserver, etc.) for event-driven test waiting
  • Integrated FakeTimeProvider across test infrastructure for deterministic time control in timer/reminder tests
  • Replaced Thread.Sleep/Task.Delay polling patterns with event-driven waiting throughout test suite
  • Enhanced LeaseBasedQueueBalancer with diagnostic event emission for streaming tests

Reviewed changes

Copilot reviewed 87 out of 87 changed files in this pull request and generated no comments.

Show a summary per file
File Description
test/TestInfrastructure/TestExtensions/*DiagnosticObserver.cs New diagnostic observer classes for event-driven test waiting
src/Orleans.TestingHost/Logging/InMemoryLoggerProvider.cs New in-memory logging infrastructure for test log capture
test/TesterInternal/TimerTests/ReminderTests_*.cs Converted to FakeTimeProvider + event-driven waiting
test/TesterInternal/ActivationsLifeCycleTests/*.cs Replaced Task.Delay with FakeTimeProvider and event-driven deactivation waiting
test/DefaultCluster.Tests/TimerOrleansTest.cs Replaced polling loops with event-driven timer tick waiting
src/Orleans.Streaming/QueueBalancer/LeaseBasedQueueBalancer.cs Added diagnostic event emission for queue balancer changes
src/Orleans.Runtime/Timers/AsyncTimerFactory.cs Integrated TimeProvider for testable timer creation
test/Grains/*/PlacementTestGrain.cs Added methods for deterministic overload detector testing
test/**/LeaseBasedQueueBalancer.cs Fixed race condition with proper two-phase latching pattern

@ReubenBond ReubenBond force-pushed the fix/test-flakiness/1 branch 2 times, most recently from 1ae919e to 41f4211 Compare January 6, 2026 21:36
@ReubenBond ReubenBond requested a review from Copilot January 6, 2026 21:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 94 out of 94 changed files in this pull request and generated no new comments.

@ReubenBond ReubenBond force-pushed the fix/test-flakiness/1 branch from 1c982a1 to 9f72802 Compare January 7, 2026 15:22
@ReubenBond ReubenBond marked this pull request as draft January 20, 2026 23:58
ReubenBond and others added 6 commits April 1, 2026 10:33
Add structured DiagnosticListener/DiagnosticSource event definitions covering:
- Grain lifecycle (Created, Activated, Deactivating, Deactivated)
- Silo/Client lifecycle (StageStarting/Completed/Failed, ObserverStarting/Completed/Failed)
- Membership (SiloStatusChanged, ViewChanged, SiloSuspected, SiloDeclaredDead)
- Placement (StatisticsPublished/Received, ClusterStatisticsRefreshed)
- Rebalancer (CycleStart/Stop, SessionStart/Stop)
- Reminders (Registered, Unregistered, TickFiring/Completed/Failed)
- Timers (TickStart/Stop, Created, Disposed)
- Streaming (MessageDelivered, StreamInactive, SubscriptionAdded/Removed, QueueLeases)

Each event uses typed record payloads emitted via DiagnosticListener.Write()
behind IsEnabled() guards. These enable deterministic test waiting and
advanced diagnostics scenarios.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add DiagnosticListener-based event emission to core runtime components:

- SiloLifecycleSubject: lifecycle stage and observer start/stop/fail events
- MembershipTableManager: silo status changes, view changes, join/active/dead
- DeploymentLoadPublisher: statistics published/received/refreshed/removed
- GrainTimer: timer tick start/stop, created, disposed

All events are guarded by IsEnabled() checks to avoid overhead when
no listener is subscribed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Inject TimeProvider into AsyncTimer and AsyncTimerFactory, replacing
direct DateTime.UtcNow and Task.Delay calls with TimeProvider-based
alternatives. This enables FakeTimeProvider usage in tests for
deterministic timing control.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add test infrastructure for event-driven test waiting:

Test observers (in TestExtensions):
- GrainDiagnosticObserver: wait for grain created/activated/deactivated counts
- MembershipDiagnosticObserver: wait for silo status changes
- PlacementDiagnosticObserver: wait for statistics propagation
- RebalancerDiagnosticObserver: wait for rebalancing cycle counts
- ReminderDiagnosticObserver: wait for reminder tick counts
- TimerDiagnosticObserver: wait for timer tick counts

Test utilities (in Orleans.TestingHost):
- DiagnosticEventCollector: generic listener for any DiagnosticSource
- InMemoryLoggerProvider: log capture with TimeProvider support

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Instrument LocalReminderService with diagnostic events:
- Registered/Unregistered on reminder lifecycle
- TickFiring/TickCompleted/TickFailed around reminder callbacks
- Inject TimeProvider for deterministic timing

Instrument streaming pipeline with diagnostic events:
- PersistentStreamPullingAgent: MessageDelivered, StreamInactive
- LeaseBasedQueueBalancer: QueueBalancerChanged, QueueLeasesAcquired/Released
- Thread TimeProvider through PersistentStreamPullingManager
- StreamConsumerCollection: accept explicit timestamp parameter

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Update all AsyncTimerFactory constructor calls in membership tests
to pass TimeProvider.System as the new required parameter.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@ReubenBond ReubenBond force-pushed the fix/test-flakiness/1 branch from 1616890 to 9a6f6b9 Compare April 1, 2026 18:03
@ReubenBond ReubenBond marked this pull request as ready for review April 1, 2026 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants